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Abstract 

We empirically investigate the best trade-off between sparse and uniformly- 
weighted multiple kernel learning (MKL) using the elastic-net regular- 
ization on real and simulated datasets. We find that the best trade-off 
parameter depends not only on the sparsity of the true kernel-weight spec- 
trum but also on the linear dependence among kernels and the number of 
samples. 

1 Introduction 

Sparse multiple kernel learning (MKL; see [9j [T2j [2]) is often outperformed by 
the simple uniformly-weighted MKL in terms of accuracy [31 [5] . However the 
sparsity offered by the sparse MKL is helpful in understanding which feature 
is useful and can also save a lot of computation in practice. In this paper we 
investigate this trade-off between the sparsity and accuracy using an elastic-net 
type regularization term which is a smooth interpolation between the sparse 
) MKL and the uniformly- weighted MKL. In addition, we extend the recently 
proposed SpicyMKL algorithm |15) for efficient optimization in the proposed 
elastic-net regularized MKL framework. Based on real and simulated MKL 
problems with more than 1000 kernels, we show that: 

1. Sparse MKL indeed suffers from poor accuracy when the number of sam- 
ples is small. 

2. As the number of samples grows larger, the difference in the accuracy 
between sparse MKL and uniformly-weighted MKL becomes smaller. 

3. Often the best accuracy is obtained in between the sparse and uniformly- 
weighted MKL. This can be explained by the dependence among candidate 
kernels having neighboring kernel parameter values. 



*Both authors contributed equality to this work. 
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2 Method 



Let us assume that we are provided with M reproducing kernel Hilbert spaces 
(RKHSs) equipped with kernel functions k m : X x X M (m = 1, . . . , M) and 
the task is to learn a classifier from N training examples {(xi,yi)}fL 1 , where 
Xi € X and yi € {— 1, +1} (i = 1,... , N). We formulate this problem into the 
following minimization problem: 



N m M , 

niinimizc J2 £ {J2 ^'"^ + ^Vt)+Cj2 (i 1 ~ ^)\\fm\\n m + ^\\fm\\n 



(m=l,...,Af), 
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where in the first term, f m is a member of the m-th RKHS % m , b is a bias term, 
and £ is a loss function; in this paper we use the logistic loss function. The second 
term is a regularization term and is a mixture of t\- and £2- regularization terms. 
The constant C (> 0) determines the overall trade-off between the loss term and 
the regularization terms. Here the first regularization term is the linear sum of 
RKHS norms, which is known to make only few / m 's non-zero (i.e., sparse, see 
[HJ [T51 [T] ) ; the second regularization term is the squared sum of RKHS norms. 
The two regularization terms are balanced by the constant A(0<A<1);A = 
corresponds to sparse (£1-) MKL and A = 1 corresponds to uniformly-weighted 
MKL. 

Due to the representer theorem (see |13j ) , the solution of the above minimiza- 
tion problem ([I]) takes the form f m (x) — J2i=i km{ x > Xi)a^ m (m = 1, . . . , M); 
therefore we can equivalently solve the following finite-dimensional minimization 
problem: 
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minimize l( ^ K m a m + blj + C ^ (C 1 _ A) 1 1 ck t „ 1 1 jc^ 



(ro=l,...,Af) 
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where K m 6 M. NxN is the m-th Gram matrix, a m = (ai, m , ■ ■ ■ , ctN,m) T is the 
weight vector for the m-th kernel, and 1 € 1* is a vector o f all one; in ad dition, 
L ( z ) = £™=i Vi) ■ Moreover, we define \\a. m \\K m = \/a m T K. m a m . 

The minimization problem ([1]) is connected to the commonly used "learning 
the kernel-weights" formulation of MKL in the following way. First let us define 
g(x) = (1 — \)i/x+ -|x for x > and g(x) = —00 for x < 0. Since g is a concave 
function, it can be linearly upper-bounded as g(x) < xy — g*(y), where g*(y) is 
the concave conjugate of g(x). Thus substituting x = \\oL m \\ 2 Km and y = -^j— 
for m = 1, . . . , M in Eq. @, we have: 
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where 



./ 1 \ 1 (1 - \ffi m 
9 \2pJ 2 l-\f3 m 

Minimizing the above expression wrt a m while keeping the loss term unchanged 
(i.e., X)m=i K- m ot m = z for some z), we have a m — /3 m cc* and finally we can 
rewrite Eq. @ as follows: 

„ M 
minimize l( K(/3)a* + bl) + - (a* T K((3)a* + V g(B m )), 

a*eK",beK,/3eR M V / 2 V / 

m=l 

where K(J3) = YZ=i?™K m and g(fi m ) = -2 5 *(l/(2/3 m )). Therefore Eq. © 
is equivalent to learning the decision function with a combined kernel K{f3) 
with the Tikhonov regularization on the kernel weights (3 m . Note that g{(3) = (3 
(^i-MKL) if A = and g{j3) approaches the indicator function of the closed 
interval [0, 1] in the limit A — > 1 (uniformly- weighted MKL). In this paper we 
call (3 = (f3 m )m=i a kernel-weight spectrum. 

The regularization in Eq. ([1]) is known as the elastic-net regularization [19 . 
In the context of MKL, Shawe- Taylor [T3] proposed a similar approach that 
uses the square of the linear sum of norms in Eq. ([2]). Both Shawe- Taylor's and 
our approach use mixed (t\- and £2-) regularization on the weight vector (or its 
non-parametric version) in the hope of curing the over-sparseness of fi-MKL. 

There are alternative approaches that apply non-£i-regularization on the 
kernel weights f3 m . Longworth and Gales used a combination of £i-norm 
constraint and ^2-norm penalization on the kernel weights. Kloft et al. [5] pro- 
posed to regularize the £ p -norm of the kernel weights (see also [4]). Our ap- 
proach (and [llj ) differ from [8] in that we can obtain different levels of sparsity 
for all A < 1 (see bottom row of Fig. [1]), whereas for all p > 1 the resulting 
kernel- weight spectrum is dense in |8] . Note also that uniformly- weighted MKL 
(ip = 00 in [11] and p = 00 in [S]) corresponds to A = 1 in our approach, which 
may be a possible advantage of our approach. 

3 Results 
3.1 Real data 

We computed 1,760 kernel functions on 10 binary image classification prob- 
lems (between every combinations of "anchor", "ant", "cannon", "chair", and 
"cup") from Caltech 101 dataset j5]. The kernel functions were constructed as 
combinations of the following four factors in the prepossessing pipeline: 

• Four types of SIFT features, namely hsvsift (adaptive scale), sift (adaptive 
scale), sift (scale fixed to 4px), sift (scale fixed to 8px). We used the 
implementation by van de Sande et al. |17j . The local features were 
sampled uniformly (grid) from each input image. We randomly choosed 
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200 local features and assigned visual words to every local features using 
these 200 points as cluster centers. 

• Local histograms obtained by partitioning the image into rectangular cells 
of the same size in a hierarchical manner; i.e., level-0 partitioning has 1 
cell (whole image) level- 1 partitioning has 4 cells and level-2 partitioning 
has 16 cells. From each cell we computed a kernel function by measuring 
the similarity of the two local feature histograms computed in the same 
cell from two images. In addition, the spatial- pyramid kernel [3 [10], which 
combines these kernels by exponentially decaying weights, was computed. 
In total, we used 22 kernels (=one level-0 kernel + four level-1 kernels + 
16 level-2 kernels + one spatial- pyramid kernel). See also [5] for a similar 
approach. 

• Two kernel functions (similarity measures). We used the Gaussian kernel: 



for 10 band-width parameters (7's) linearly spaced between 0.1 and 5 and 
the x 2 -kernel: 



for 10 band- width parameters (7's) linearly spaced between 0.1 and 10, 
where q(x),q(x') £ N™ are the histograms computed in some region of two 
images x and x' . 

The combination of 4 sift features, 22 spacial regions, 2 kernel functions, and 
10 parameters resulted in 1,760 kernel functions in total. 

Figure [T] shows the average classification accuracy and the number of active 
kernels obtained at different values of the trade-off parameter A. We can see 
that sparse MKL (A = 0) can be significantly outperformed by simple uniformly- 
weight MKL (A = 1) when the number of samples (N) is small. As the number 
of samples grows the difference between the two cases decreases. Moreover, the 
best accuracy is obtained at more and more sparse solutions as the number of 
samples grows larger. 

3.2 Simulated data 

In order to explain the results from the image-classification dataset in a simple 
setting, we generated three toy problems. In the first problem we placed one 
Gaussian kernel over each input variable that was independently sampled from 
the standard normal distribution. The number of input variables was 100. We 
call this setting Feature selection. In the second problem we increased the 
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Figure 1: Image classification results from Caltech 101 dataset. The trade-off 
parameters A that achieve the highest test accuracy are marked by stars. 



variety of kernels by introducing 12 kernels with different band- widths on each 
input variable. The number of input variables was 10. We call this setting 
Feature & Parameter selection. In the third problem, we used the same 12 
kernel functions with different band-widths but jointly over the same set of 10 
input variables. We call this setting Parameter selection. The true kernel- 
weight spectrum {Pm)m=\ was changed from sparse (only two non-zero j3 m 's), 
medium-dense (exponentially decaying spectrum) to dense (uniform spectrum) . 

Figure [2] shows the test classification accuracy obtained from training the 
proposed elastic-net MKL model to nine toy-problems with different goals and 
different true kernel-weight spectra. We choose the best regularization constant 
C for each plot. First we can observe that when the goal is to choose a subset of 
kernels from independent data-sources (top row), the best trade-off parameter A 
is mostly determined by the true kernel- weight spectrum; i.e., small A for sparse 
and large A for dense spectrum. Remarkably the sparse MKL (A = 0) performs 
well even when the number of samples is smaller than that of kernels if the 
true kernel-weight spectrum is sparse. On the other hand, if we also consider 
the selection of kernel parameter through MKL (middle row), the best trade-off 
parameter A is often obtained in between zero and one and seems to depend 
less on the true kernel-weight spectrum. This finding seems to be consistent 
with the observation in [19] that the elastic- net (0 < A < 1) performs well when 
the input variables are linearly dependent because kernels that only differ in 
the band-width can have significant dependency to each other. Furthermore, if 
we consider the selection of kernel parameter only (bottom row), the accuracy 
becomes almost flat for all A regardless of the true kernel- weight spectrum. The 
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Figure 2: Classification accuracy obtained from the simulated datasets. The 
magenta colored curves with stars denote the value of trade-off parameters A 
that yield the highest test accuracy. 



behaviour in the Caltech dataset seems to be most similar to the second column 
of the second row (feature & parameter selection under medium sparsity) . 

4 Summary 

In this paper, we have empirically investigated the trade-off between sparse 
and uniformly-weighted MKL using the elastic-net type regularization term for 
MKL. The sparsity of the solution is modulated by changing the trade-off pa- 
rameter A. We consistently found that, (a) often the uniformly-weighted MKL 
(A = 1) outperforms sparse MKL (A = 0); (b) the difference between the two 
cases decreases as the number of samples increases; (c) when the input kernels 
are independent, the sparse MKL seems to be favorable if the true kernel- weight 
spectrum is not too dense; (d) when the input kernels are linearly dependent 
(e.g., kernels with neighboring parameter values are included), intermediate A 
value seems to be favorable. We have also observed that as the number of sam- 
ples increases the sparser solution (small A) is preferred. It was also observed 
(results not shown) that sparser solution is preferred when the noise in the 
training labels is small. 
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